Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling

نویسندگان

  • Graham Cormode
  • S. Muthukrishnan
  • Irina Rozenbaum
چکیده

Emerging data stream management systems approach the challenge of massive data distributions which arrive at high speeds while there is only small storage by summarizing and mining the distributions using samples or sketches. However, data distributions can be “viewed” in different ways. A data stream of integer values can be viewed either as the forward distribution f(x), ie., the number of occurrences of x in the stream, or as its inverse, f−1(i), which is the number of items that appear i times. While both such “views” are equivalent in stored data systems, over data streams that entail approximations, they may be significantly different. In other words, samples and sketches developed for the forward distribution may be ineffective for summarizing or mining the inverse distribution. Yet, many applications such as IP traffic monitoring naturally rely on mining inverse distributions. We formalize the problems of managing and mining inverse distributions and show provable differences between summarizing the forward distribution vs the inverse distribution. We present methods for summarizing and mining inverse distributions of data streams: they rely on a novel technique to maintain a dynamic sample over the stream with provable guarantees which can be used for variety of summarization tasks (building quantiles or equidepth histograms) and mining (anomaly detection: finding heavy hitters, and measuring the number of rare items), all with provable guarantees on quality of approximations and time/space used by our streaming methods. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005 We also complement our analytical and algorithmic results by presenting an experimental study of the methods over network data streams.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparing ordinary kriging and advanced inverse distance squared methods based on estimating coal deposits; case study: East-Parvadeh deposit, central Iran

Finding a proper estimation method for ore resources/reserves is important in mining engineering. The aim of this work is to compare the Ordinary Kriging (OK) and Advanced Inverse Distance Squared (AIDS) methods based on the correlation between the raw and estimated data in the East-Parvadeh coal deposit, central Iran. The variograms and anisotropic ellipsoids are calculated to estimate the ash...

متن کامل

EMPIRICAL BAYES ANALYSIS OF TWO-FACTOR EXPERIMENTS UNDER INVERSE GAUSSIAN MODEL

A two-factor experiment with interaction between factors wherein observations follow an Inverse Gaussian model is considered. Analysis of the experiment is approached via an empirical Bayes procedure. The conjugate family of prior distributions is considered. Bayes and empirical Bayes estimators are derived. Application of the procedure is illustrated on a data set, which has previously been an...

متن کامل

Algorithmic Techniques for Processing Data Streams

We give a survey at some algorithmic techniques for processing data streams. After covering the basic methods of sampling and sketching, we present more evolved procedures that resort on those basic ones. In particular, we examine algorithmic schemes for similarity mining, the concept of group testing, and techniques for clustering and summarizing data streams. 1998 ACM Subject Classification F...

متن کامل

An Hybrid Data Stream Summarizing Approach by Sampling and Clustering

Computer systems generate a large amount of data that, in terms of space and time, is very expensive even impossible to store. Besides this, many applications need to keep an historical view of such data in order to provide historical aggregated information, perform data mining tasks or detect anomalous behavior in the computer systems. One solution is to treat the data as streams that can be p...

متن کامل

Summarizing and Mining Skewed Data Streams

Many applications generate massive data streams. Summarizing such massive data requires fast, small space algorithms to support post-hoc queries and mining. An important observation is that such streams are rarely uniform, and real data sources typically exhibit significant skewness. These are well modeled by Zipf distributions, which are characterized by a parameter, z, that captures the amoun...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005